Serveur d'exploration sur Mozart

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Name Discrimination by Clustering Similar Contexts

Identifieur interne : 001A66 ( Main/Exploration ); précédent : 001A65; suivant : 001A67

Name Discrimination by Clustering Similar Contexts

Auteurs : Ted Pedersen [États-Unis] ; Amruta Purandare [États-Unis] ; Anagha Kulkarni [États-Unis]

Source :

RBID : ISTEX:B56BF55C08AEFE31148251FE4682317E0EA1301E

Abstract

Abstract: It is relatively common for different people or organizations to share the same name. Given the increasing amount of information available online, this results in the ever growing possibility of finding misleading or incorrect information due to confusion caused by an ambiguous name. This paper presents an unsupervised approach that resolves name ambiguity by clustering the instances of a given name into groups, each of which is associated with a distinct underlying entity. The features we employ to represent the context of an ambiguous name are statistically significant bigrams that occur in the same context as the ambiguous name. From these features we create a co–occurrence matrix where the rows and columns represent the first and second words in bigrams, and the cells contain their log–likelihood scores. Then we represent each of the contexts in which an ambiguous name appears with a second order context vector. This is created by taking the average of the vectors from the co–occurrence matrix associated with the words that make up each context. This creates a high dimensional “instance by word” matrix that is reduced to its most significant dimensions by Singular Value Decomposition (SVD). The different “meanings” of a name are discriminated by clustering these second order context vectors with the method of Repeated Bisections. We evaluate this approach by conflating pairs of names found in a large corpus of text to create ambiguous pseudo-names. We find that our method is significantly more accurate than the majority classifier, and that the best results are obtained by having a small amount of local context to represent the instance, along with a larger amount of context for identifying features, or vice versa.

Url:
DOI: 10.1007/978-3-540-30586-6_24


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI wicri:istexFullTextTei="biblStruct">
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Name Discrimination by Clustering Similar Contexts</title>
<author>
<name sortKey="Pedersen, Ted" sort="Pedersen, Ted" uniqKey="Pedersen T" first="Ted" last="Pedersen">Ted Pedersen</name>
</author>
<author>
<name sortKey="Purandare, Amruta" sort="Purandare, Amruta" uniqKey="Purandare A" first="Amruta" last="Purandare">Amruta Purandare</name>
</author>
<author>
<name sortKey="Kulkarni, Anagha" sort="Kulkarni, Anagha" uniqKey="Kulkarni A" first="Anagha" last="Kulkarni">Anagha Kulkarni</name>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:B56BF55C08AEFE31148251FE4682317E0EA1301E</idno>
<date when="2005" year="2005">2005</date>
<idno type="doi">10.1007/978-3-540-30586-6_24</idno>
<idno type="url">https://api.istex.fr/document/B56BF55C08AEFE31148251FE4682317E0EA1301E/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">001290</idno>
<idno type="wicri:Area/Istex/Curation">000F78</idno>
<idno type="wicri:Area/Istex/Checkpoint">001398</idno>
<idno type="wicri:doubleKey">0302-9743:2005:Pedersen T:name:discrimination:by</idno>
<idno type="wicri:Area/Main/Merge">001A95</idno>
<idno type="wicri:Area/Main/Curation">001A66</idno>
<idno type="wicri:Area/Main/Exploration">001A66</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title level="a" type="main" xml:lang="en">Name Discrimination by Clustering Similar Contexts</title>
<author>
<name sortKey="Pedersen, Ted" sort="Pedersen, Ted" uniqKey="Pedersen T" first="Ted" last="Pedersen">Ted Pedersen</name>
<affiliation wicri:level="2">
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>University of Minnesota, 55812, Duluth, MN</wicri:regionArea>
<placeName>
<region type="state">Minnesota</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Purandare, Amruta" sort="Purandare, Amruta" uniqKey="Purandare A" first="Amruta" last="Purandare">Amruta Purandare</name>
<affiliation wicri:level="4">
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>University of Pittsburgh, 15260, Pittsburgh, PA</wicri:regionArea>
<placeName>
<region type="state">Pennsylvanie</region>
</placeName>
<orgName type="university">Université de Pittsburgh</orgName>
</affiliation>
</author>
<author>
<name sortKey="Kulkarni, Anagha" sort="Kulkarni, Anagha" uniqKey="Kulkarni A" first="Anagha" last="Kulkarni">Anagha Kulkarni</name>
<affiliation wicri:level="2">
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>University of Minnesota, 55812, Duluth, MN</wicri:regionArea>
<placeName>
<region type="state">Minnesota</region>
</placeName>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series>
<title level="s">Lecture Notes in Computer Science</title>
<imprint>
<date>2005</date>
</imprint>
<idno type="ISSN">0302-9743</idno>
<idno type="eISSN">1611-3349</idno>
</series>
<idno type="istex">B56BF55C08AEFE31148251FE4682317E0EA1301E</idno>
<idno type="DOI">10.1007/978-3-540-30586-6_24</idno>
<idno type="ChapterID">Chap24</idno>
<idno type="ChapterID">24</idno>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
<langUsage>
<language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">Abstract: It is relatively common for different people or organizations to share the same name. Given the increasing amount of information available online, this results in the ever growing possibility of finding misleading or incorrect information due to confusion caused by an ambiguous name. This paper presents an unsupervised approach that resolves name ambiguity by clustering the instances of a given name into groups, each of which is associated with a distinct underlying entity. The features we employ to represent the context of an ambiguous name are statistically significant bigrams that occur in the same context as the ambiguous name. From these features we create a co–occurrence matrix where the rows and columns represent the first and second words in bigrams, and the cells contain their log–likelihood scores. Then we represent each of the contexts in which an ambiguous name appears with a second order context vector. This is created by taking the average of the vectors from the co–occurrence matrix associated with the words that make up each context. This creates a high dimensional “instance by word” matrix that is reduced to its most significant dimensions by Singular Value Decomposition (SVD). The different “meanings” of a name are discriminated by clustering these second order context vectors with the method of Repeated Bisections. We evaluate this approach by conflating pairs of names found in a large corpus of text to create ambiguous pseudo-names. We find that our method is significantly more accurate than the majority classifier, and that the best results are obtained by having a small amount of local context to represent the instance, along with a larger amount of context for identifying features, or vice versa.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>États-Unis</li>
</country>
<region>
<li>Minnesota</li>
<li>Pennsylvanie</li>
</region>
<orgName>
<li>Université de Pittsburgh</li>
</orgName>
</list>
<tree>
<country name="États-Unis">
<region name="Minnesota">
<name sortKey="Pedersen, Ted" sort="Pedersen, Ted" uniqKey="Pedersen T" first="Ted" last="Pedersen">Ted Pedersen</name>
</region>
<name sortKey="Kulkarni, Anagha" sort="Kulkarni, Anagha" uniqKey="Kulkarni A" first="Anagha" last="Kulkarni">Anagha Kulkarni</name>
<name sortKey="Purandare, Amruta" sort="Purandare, Amruta" uniqKey="Purandare A" first="Amruta" last="Purandare">Amruta Purandare</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Musique/explor/MozartV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001A66 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 001A66 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Musique
   |area=    MozartV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     ISTEX:B56BF55C08AEFE31148251FE4682317E0EA1301E
   |texte=   Name Discrimination by Clustering Similar Contexts
}}

Wicri

This area was generated with Dilib version V0.6.20.
Data generation: Sun Apr 10 15:06:14 2016. Site generation: Tue Feb 7 15:40:35 2023